Background:
SCCU(Swire Coca-Cola United States) tries to optimize logistics by transitioning customers selling below a specific annual volume to an Alternate Route to Market (ARTM). There is an annual 400 gallons volume threshold used to distinguish the customers between the direct delivery route and ARTM. However, SCCU is looking for a more cost-efficient strategy to decide new threshold for optimizing logistics which is driving better operational efficiency and more revenues.
Requirement:
The analysis will focus on classifying which customers must be included in ARTM or Direct route, and which volume threshold would be optimal to decide for the classification.
The analysis will focus on two key customer segments.
Questions:
What factors or characteristics distinguish customers with annual sales exceeding the determined volume threshold from those below this threshold?
How can SCCU uses historical sales data, or other Customer Characteristics to predict which ARTM customers have the potential to grow beyond the volume threshold annually?
How can these insights be integrated into the routing strategy to support long-term growth while maintaining logistical efficiency?
What levers can be employed to accelerate volume and share growth at growth-ready, high-potential customers?
# import libraries
library(tidyverse)
library(janitor)
library(skimr)
library(psych)
library(glue)
library(here)
library(readxl)
# import datasets
address_df<- read_csv(here("Dataset",
"customer_address_and_zip_mapping.csv"))
profile_df <- read_csv(here("Dataset","customer_profile.csv"))
delivery_cost_df <- read_xlsx(here("Dataset","delivery_cost_data.xlsx"))
trans_df <- read_csv(here("Dataset","transactional_data.csv"))
Variables can be described as below.
sample_n(address_df, 10)
Variables can be described as below.
address_dfsample_n(profile_df,10)
Variables can be described as below.
sample_n(delivery_cost_df,10)
Variables can be described as below.
sample_n(trans_df,10)
skim(address_df)
| Name | address_df |
| Number of rows | 1801 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| full address | 0 | 1 | 45 | 73 | 0 | 1801 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zip | 0 | 1 | 28919.81 | 25588.64 | 1001 | 2153 | 21634 | 42440 | 71483 | ▇▃▅▁▅ |
skim(profile_df)
| Name | profile_df |
| Number of rows | 30478 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| logical | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| FREQUENT_ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 6 | 0 |
| FIRST_DELIVERY_DATE | 0 | 1 | 8 | 10 | 0 | 2401 | 0 |
| ON_BOARDING_DATE | 0 | 1 | 8 | 10 | 0 | 6487 | 0 |
| COLD_DRINK_CHANNEL | 0 | 1 | 5 | 13 | 0 | 9 | 0 |
| TRADE_CHANNEL | 0 | 1 | 6 | 28 | 0 | 26 | 0 |
| SUB_TRADE_CHANNEL | 0 | 1 | 4 | 27 | 0 | 48 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| LOCAL_MARKET_PARTNER | 0 | 1 | 0.90 | TRU: 27355, FAL: 3123 |
| CO2_CUSTOMER | 0 | 1 | 0.39 | FAL: 18496, TRU: 11982 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CUSTOMER_NUMBER | 0 | 1.0 | 538301800.92 | 47950644.47 | 500245678 | 501164306 | 501573995 | 600075795 | 600975408 | ▇▁▁▁▅ |
| PRIMARY_GROUP_NUMBER | 18196 | 0.4 | 2779.85 | 2608.64 | 4 | 444 | 1892 | 4488 | 9999 | ▇▃▂▁▁ |
| ZIP_CODE | 0 | 1.0 | 30252.25 | 25953.08 | 1001 | 2155 | 21771 | 42762 | 71483 | ▇▃▅▁▆ |
skim(delivery_cost_df)
| Name | delivery_cost_df |
| Number of rows | 160 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Cold Drink Channel | 0 | 1 | 5 | 13 | 0 | 8 | 0 |
| Vol Range | 0 | 1 | 5 | 11 | 0 | 10 | 0 |
| Applicable To | 0 | 1 | 8 | 16 | 0 | 2 | 0 |
| Cost Type | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Median Delivery Cost | 0 | 1 | 2.6 | 1.71 | 0.37 | 1.33 | 2.24 | 3.47 | 8.59 | ▇▆▂▁▁ |
skim(trans_df)
| Name | trans_df |
| Number of rows | 1045540 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| TRANSACTION_DATE | 0 | 1 | 8 | 10 | 0 | 723 | 0 |
| ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| WEEK | 0 | 1 | 26.23 | 14.52 | 1.0 | 14 | 26 | 38.00 | 52.00 | ▇▇▇▇▇ |
| YEAR | 0 | 1 | 2023.50 | 0.50 | 2023.0 | 2023 | 2023 | 2024.00 | 2024.00 | ▇▁▁▁▇ |
| CUSTOMER_NUMBER | 0 | 1 | 546643776.32 | 49426585.56 | 500245678.0 | 501091920 | 501548213 | 600080939.00 | 600975408.00 | ▇▁▁▁▇ |
| ORDERED_CASES | 0 | 1 | 26.85 | 126.76 | 0.0 | 0 | 7 | 18.50 | 8479.89 | ▇▁▁▁▁ |
| LOADED_CASES | 0 | 1 | 25.92 | 122.79 | 0.0 | 0 | 7 | 18.00 | 8171.56 | ▇▁▁▁▁ |
| DELIVERED_CASES | 0 | 1 | 25.13 | 121.52 | -3132.0 | 0 | 6 | 17.33 | 8069.48 | ▁▇▁▁▁ |
| ORDERED_GALLONS | 0 | 1 | 9.87 | 26.47 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| LOADED_GALLONS | 0 | 1 | 9.60 | 25.65 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| DELIVERED_GALLONS | 0 | 1 | 9.21 | 25.18 | -1792.5 | 0 | 0 | 12.50 | 2292.50 | ▁▁▇▁▁ |
colSums(is.na(address_df))
## zip full address
## 0 0
colSums(is.na(profile_df))
## CUSTOMER_NUMBER PRIMARY_GROUP_NUMBER FREQUENT_ORDER_TYPE
## 0 18196 0
## FIRST_DELIVERY_DATE ON_BOARDING_DATE COLD_DRINK_CHANNEL
## 0 0 0
## TRADE_CHANNEL SUB_TRADE_CHANNEL LOCAL_MARKET_PARTNER
## 0 0 0
## CO2_CUSTOMER ZIP_CODE
## 0 0
colSums(is.na(delivery_cost_df))
## Cold Drink Channel Vol Range Applicable To
## 0 0 0
## Median Delivery Cost Cost Type
## 0 0
colSums(is.na(trans_df))
## TRANSACTION_DATE WEEK YEAR CUSTOMER_NUMBER
## 0 0 0 0
## ORDER_TYPE ORDERED_CASES LOADED_CASES DELIVERED_CASES
## 0 0 0 0
## ORDERED_GALLONS LOADED_GALLONS DELIVERED_GALLONS
## 0 0 0
PRIMARY_GROUP_NUMBER has a 18196 missing values, which
takes up 60% of profile_df dataset.How many customers are partnered with Local Market Partners out of the entire customers?
How many customers are purchasing C02 products out of entire customers?
Which number can we extract out of transaction history?
How many customers belongs to the direct route based on the original volume threshold? And how many customers belong to the ARTM based on the original volume threshold?
Which customer characteristics have brought more profits from given transaction data?
How many customers belongs to the Local Market Partners that buy fountains only? (Group Segment 1)
How many Customers moved above and below the Threshold from 2023 to 2024?
What is the Net change in customers moving between threshold categories? (Low Volume, Medium Volume, High Volume)
What percentage of customers upgraded or downgraded between categories?
What are the key patterns in customer order volume reduction from 2023 to 2024.
Among customers who reduced their order volume , what is the average percentage drop?
# the distribution of local market partner customers out of entire customers
table(profile_df$LOCAL_MARKET_PARTNER)
##
## FALSE TRUE
## 3123 27355
round(prop.table(table(profile_df$LOCAL_MARKET_PARTNER)),2)
##
## FALSE TRUE
## 0.1 0.9
Approximately, 90% of listed customers belong to the local market partners, which indicates that they are smaller, regionally focused customers who serve their local communities. They tend to show their reliance on local market dynamics and consistent purchasing patterns.
# the distribution of CO2 customers out of entire customers
table(profile_df$CO2_CUSTOMER)
##
## FALSE TRUE
## 18496 11982
round(prop.table(table(profile_df$CO2_CUSTOMER)),2)
##
## FALSE TRUE
## 0.61 0.39
Approximately, 40% of listed customer belongs to the CO2 customer, which represents that they have purchased carbon dioxide materials.
trans_df %>%
summarise(customer_n = n_distinct(CUSTOMER_NUMBER))
trans_df %>%
summarise(case_volume = sum(ORDERED_CASES),
gallon_volume = sum(ORDERED_GALLONS),
total_volume = case_volume + gallon_volume)
max(as.Date(trans_df$TRANSACTION_DATE, format="%m/%d/%Y"))
## [1] "2024-12-31"
min(as.Date(trans_df$TRANSACTION_DATE, format="%m/%d/%Y"))
## [1] "2023-01-01"
30322 customers have transacted 28,074,470 cases and 10,323,337 gallons (total 38,397,807 units) with SCCU from 1/1/2023 to 12/31/2024. (2 years)
trans_history <-
trans_df %>%
mutate(TRANSACTION_DATE = as.Date(TRANSACTION_DATE, format="%m/%d/%Y")) %>%
#mutate(CUSTOMER_NUMBER = as.integer(CUSTOMER_NUMBER)) %>%
group_by(CUSTOMER_NUMBER) %>%
summarise(
FIRST_TRANSACTION_DATE = min(TRANSACTION_DATE),
LAST_TRANSACTION_DATE = max(TRANSACTION_DATE),
TRANS_DAYS = LAST_TRANSACTION_DATE - FIRST_TRANSACTION_DATE + 1,
TRANS_COUNT = n(),
TRANS_COUNT_2023 = sum((year(TRANSACTION_DATE) == 2023)),
TRANS_COUNT_2024 = sum((year(TRANSACTION_DATE) == 2024)),
ANNUAL_VOLUME_CASES_2023 = sum((year(TRANSACTION_DATE) == 2023) * ORDERED_CASES, na.rm = TRUE),
ANNUAL_VOLUME_GALLON_2023 = sum((year(TRANSACTION_DATE) == 2023) * ORDERED_GALLONS, na.rm = TRUE),
ANNUAL_VOLUME_CASES_2024 = sum((year(TRANSACTION_DATE) == 2024) * ORDERED_CASES, na.rm = TRUE),
ANNUAL_VOLUME_GALLON_2024 = sum((year(TRANSACTION_DATE) == 2024) * ORDERED_GALLONS, na.rm = TRUE),
ANNUAL_VOLUME_2023 = sum((year(TRANSACTION_DATE) == 2023) * (ORDERED_CASES + ORDERED_GALLONS), na.rm = TRUE),
AVG_ORDER_VOLUME_2023 = ANNUAL_VOLUME_2023 / TRANS_COUNT_2023,
ANNUAL_VOLUME_2024 = sum((year(TRANSACTION_DATE) == 2024) * (ORDERED_CASES + ORDERED_GALLONS), na.rm = TRUE),
AVG_ORDER_VOLUME_2024 = ANNUAL_VOLUME_2024 / TRANS_COUNT_2024,
CHANGED_VOLUME = ANNUAL_VOLUME_2024 - ANNUAL_VOLUME_2023,
PERCENT_CHANGE = round(CHANGED_VOLUME/ANNUAL_VOLUME_2023,2) * 100,
THRESHOLD_2023 = ifelse(ANNUAL_VOLUME_2023 >= 400, 'above', 'below'),
THRESHOLD_2024 = ifelse(ANNUAL_VOLUME_2024 >= 400, 'above', 'below'),
) %>%
ungroup()
trans_history
colSums(is.na(trans_history))
## CUSTOMER_NUMBER FIRST_TRANSACTION_DATE LAST_TRANSACTION_DATE
## 0 0 0
## TRANS_DAYS TRANS_COUNT TRANS_COUNT_2023
## 0 0 0
## TRANS_COUNT_2024 ANNUAL_VOLUME_CASES_2023 ANNUAL_VOLUME_GALLON_2023
## 0 0 0
## ANNUAL_VOLUME_CASES_2024 ANNUAL_VOLUME_GALLON_2024 ANNUAL_VOLUME_2023
## 0 0 0
## AVG_ORDER_VOLUME_2023 ANNUAL_VOLUME_2024 AVG_ORDER_VOLUME_2024
## 4270 0 721
## CHANGED_VOLUME PERCENT_CHANGE THRESHOLD_2023
## 0 137 0
## THRESHOLD_2024
## 0
# 2023 above vs below threshold
table(trans_history$THRESHOLD_2023)
##
## above below
## 7745 22577
prop.table(table(trans_history$THRESHOLD_2023))
##
## above below
## 0.2554251 0.7445749
# 2024 above vs below threshold
table(trans_history$THRESHOLD_2024)
##
## above below
## 7867 22455
prop.table(table(trans_history$THRESHOLD_2024))
##
## above below
## 0.2594486 0.7405514
thres_change_customer <-
trans_history %>%
filter(THRESHOLD_2023 != THRESHOLD_2024)
thres_change_customer
table(thres_change_customer$THRESHOLD_2023, thres_change_customer$THRESHOLD_2024)
##
## above below
## above 0 1128
## below 1250 0
round(prop.table(table(thres_change_customer$THRESHOLD_2023, thres_change_customer$THRESHOLD_2024)),2)
##
## above below
## above 0.00 0.47
## below 0.53 0.00
However, when we get into the depth, 2,378 (8%) customers experienced a change in volume based on the original volume threshold from 2023 to 2024 out of 30,322 total customers. Among them, 1,250 customers (around 4%) exceeded the threshold in 2024 from below threshold status, whereas 1,128 (around 4%) customers drops below the threshold.
# total customer growth statistics
trans_history %>%
summarise(AVG_CHANGE_VOL = mean(CHANGED_VOLUME),
MED_CHANGE_VOL = median(CHANGED_VOLUME),
MIN_CHANGE_VOL = min(CHANGED_VOLUME),
MAX_CHANGE_VOL = max(CHANGED_VOLUME))
# below in both year growth statistics
trans_history %>%
filter(THRESHOLD_2023 == 'below' & THRESHOLD_2024 == 'below') %>%
summarise(AVG_CHANGE_VOL = mean(CHANGED_VOLUME),
MED_CHANGE_VOL = median(CHANGED_VOLUME),
MIN_CHANGE_VOL = min(CHANGED_VOLUME),
MAX_CHANGE_VOL = max(CHANGED_VOLUME))
# above in both year growth statistics
trans_history %>%
filter(THRESHOLD_2023 == 'above' & THRESHOLD_2024 == 'above') %>%
summarise(AVG_CHANGE_VOL = mean(CHANGED_VOLUME),
MED_CHANGE_VOL = median(CHANGED_VOLUME),
MIN_CHANGE_VOL = min(CHANGED_VOLUME),
MAX_CHANGE_VOL = max(CHANGED_VOLUME))
# potential growth customer statistics
trans_history %>%
filter(THRESHOLD_2023 == 'below' & THRESHOLD_2024 == 'above') %>%
summarise(AVG_CHANGE_VOL = mean(CHANGED_VOLUME),
MED_CHANGE_VOL = median(CHANGED_VOLUME),
MIN_CHANGE_VOL = min(CHANGED_VOLUME),
MAX_CHANGE_VOL = max(CHANGED_VOLUME))
# total customer
trans_history %>%
ggplot() +
geom_boxplot(aes(x = PERCENT_CHANGE)) +
theme_minimal()
# both below customer
trans_history %>%
filter(THRESHOLD_2023 == 'below' & THRESHOLD_2024 == 'below') %>%
ggplot() +
geom_boxplot(aes(x = PERCENT_CHANGE), na.rm = TRUE) +
theme_minimal()
# both above customer
trans_history %>%
filter(THRESHOLD_2023 == 'below' & THRESHOLD_2024 == 'below') %>%
ggplot() +
geom_boxplot(aes(x = PERCENT_CHANGE), na.rm = TRUE) +
theme_minimal()
# potential growth customer
trans_history %>%
filter(THRESHOLD_2023 == 'below' & THRESHOLD_2024 == 'above') %>%
ggplot() +
geom_boxplot(aes(x = PERCENT_CHANGE)) +
theme_minimal()
In order to take in-depth analysis per each of customer’s attributes,
we’ve combined the customer profile profile_df data with
trans_history , joined by CUSTOMER_NUMBER
variable.
trans_profile_df <- left_join(trans_history, profile_df, by = 'CUSTOMER_NUMBER')
sample_n(trans_profile_df,10)
volume_2023 <- sum(trans_profile_df$ANNUAL_VOLUME_2023, na.rm = TRUE)
volume_2024 <- sum(trans_profile_df$ANNUAL_VOLUME_2024, na.rm = TRUE)
trans_profile_df %>%
group_by(LOCAL_MARKET_PARTNER) %>%
summarise(TOTAL_VOL_2023 = sum(ANNUAL_VOLUME_2023),
TOTAL_VOL_2024 = sum(ANNUAL_VOLUME_2024),
PERCENT_2023 = (TOTAL_VOL_2023 / volume_2023) * 100,
PERCENT_2024 = (TOTAL_VOL_2024 / volume_2024) * 100,
AVG_VOL_2023 = mean(ANNUAL_VOLUME_2023),
AVG_VOL_2024 = mean(ANNUAL_VOLUME_2024),
MED_VOL_2023 = median(ANNUAL_VOLUME_2023),
MED_VOL_2024 = median(ANNUAL_VOLUME_2024),
COUNT_2023 = sum(TRANS_COUNT_2023),
COUNT_2024 = sum(TRANS_COUNT_2024),
ABOVE_THRES_2023 = sum(THRESHOLD_2023 == 'above'),
ABOVE_THRES_2024 = sum(THRESHOLD_2024 == 'above')
)
trans_profile_df %>%
group_by(CO2_CUSTOMER) %>%
summarise(TOTAL_VOL_2023 = sum(ANNUAL_VOLUME_2023),
TOTAL_VOL_2024 = sum(ANNUAL_VOLUME_2024),
PERCENT_2023 = (TOTAL_VOL_2023 / volume_2023) * 100,
PERCENT_2024 = (TOTAL_VOL_2024 / volume_2024) * 100,
AVG_VOL_2023 = mean(ANNUAL_VOLUME_2023),
AVG_VOL_2024 = mean(ANNUAL_VOLUME_2024),
MED_VOL_2023 = median(ANNUAL_VOLUME_2023),
MED_VOL_2024 = median(ANNUAL_VOLUME_2024),
COUNT_2023 = sum(TRANS_COUNT_2023),
COUNT_2024 = sum(TRANS_COUNT_2024),
ABOVE_THRES_2023 = sum(THRESHOLD_2023 == 'above'),
ABOVE_THRES_2024 = sum(THRESHOLD_2024 == 'above')
)
trans_profile_df %>%
group_by(FREQUENT_ORDER_TYPE) %>%
summarise(TOTAL_VOL_2023 = sum(ANNUAL_VOLUME_2023),
TOTAL_VOL_2024 = sum(ANNUAL_VOLUME_2024),
PERCENT_2023 = (TOTAL_VOL_2023 / volume_2023) * 100,
PERCENT_2024 = (TOTAL_VOL_2024 / volume_2024) * 100,
AVG_VOL_2023 = mean(ANNUAL_VOLUME_2023),
AVG_VOL_2024 = mean(ANNUAL_VOLUME_2024),
MED_VOL_2023 = median(ANNUAL_VOLUME_2023),
MED_VOL_2024 = median(ANNUAL_VOLUME_2024),
COUNT_2023 = sum(TRANS_COUNT_2023),
COUNT_2024 = sum(TRANS_COUNT_2024),
ABOVE_THRES_2023 = sum(THRESHOLD_2023 == 'above'),
ABOVE_THRES_2024 = sum(THRESHOLD_2024 == 'above')
)
trans_profile_df %>%
group_by(COLD_DRINK_CHANNEL) %>%
summarise(TOTAL_VOL_2023 = sum(ANNUAL_VOLUME_2023),
TOTAL_VOL_2024 = sum(ANNUAL_VOLUME_2024),
PERCENT_2023 = (TOTAL_VOL_2023 / volume_2023) * 100,
PERCENT_2024 = (TOTAL_VOL_2024 / volume_2024) * 100,
AVG_VOL_2023 = mean(ANNUAL_VOLUME_2023),
AVG_VOL_2024 = mean(ANNUAL_VOLUME_2024),
MED_VOL_2023 = median(ANNUAL_VOLUME_2023),
MED_VOL_2024 = median(ANNUAL_VOLUME_2024),
COUNT_2023 = sum(TRANS_COUNT_2023),
COUNT_2024 = sum(TRANS_COUNT_2024),
ABOVE_THRES_2023 = sum(THRESHOLD_2023 == 'above'),
ABOVE_THRES_2024 = sum(THRESHOLD_2024 == 'above')
)
# Group 1: Local Market Partners that buy fountains only
group1_df <-
trans_profile_df %>%
filter(!CO2_CUSTOMER
& LOCAL_MARKET_PARTNER
& ANNUAL_VOLUME_CASES_2023 == 0
& ANNUAL_VOLUME_CASES_2024 == 0)
group1_df %>%
summarise(TOTAL_VOLUME_2023 = sum(ANNUAL_VOLUME_GALLON_2023),
TOTAL_VOLUME_2024 = sum(ANNUAL_VOLUME_GALLON_2024),
ABOVE_THRES_2023 = sum(THRESHOLD_2023 == 'above'),
ABOVE_THRES_2024 = sum(THRESHOLD_2024 == 'above'))
# Define threshold (400 gallons)
threshold <- 400
# Filter data for 2023 and 2024 only
transaction_filtered <- trans_df %>%
filter(YEAR %in% c(2023, 2024))
# Summarize transactions per customer per year
customer_summary <- transaction_filtered %>%
group_by(CUSTOMER_NUMBER, YEAR) %>%
summarise(
Total_Ordered_Cases = sum(ORDERED_CASES, na.rm = TRUE),
Total_Ordered_Gallons = sum(ORDERED_GALLONS, na.rm = TRUE),
Order_Frequency = n(),
.groups = "drop"
) %>%
# Add Total Volume Calculation
mutate(
Total_Volume = Total_Ordered_Cases + Total_Ordered_Gallons,
Customer_Category = ifelse(Total_Ordered_Gallons >= threshold, "Above Threshold", "Below Threshold")
) %>%
# Volume Segmentation
mutate(
Volume_Segment = case_when(
Total_Volume >= 1000 ~ "High Volume",
Total_Volume >= 500 ~ "Medium Volume",
TRUE ~ "Low Volume"
)
) %>%
# missing values
mutate(
Customer_Category = replace_na(Customer_Category, "Unknown"),
Volume_Segment = replace_na(Volume_Segment, "Unknown")
)
# Customers who changed from 2023-2024
threshold_change_customers <- customer_summary %>%
select(CUSTOMER_NUMBER, YEAR, Customer_Category) %>%
pivot_wider(names_from = YEAR, values_from = Customer_Category, values_fill = list(Customer_Category = "No Purchase")) %>%
rename(Threshold_2023 = `2023`, Threshold_2024 = `2024`) %>%
filter(Threshold_2023 != Threshold_2024)
threshold_transition_summary <- threshold_change_customers %>%
group_by(Threshold_2023, Threshold_2024) %>%
summarise(Customers_Transitioned = n(), .groups = "drop")
# Calculate net change in threshold categories
net_change_summary <- threshold_transition_summary %>%
mutate(Change = case_when(
Threshold_2023 == "Below Threshold" & Threshold_2024 == "Above Threshold" ~ Customers_Transitioned,
Threshold_2023 == "Above Threshold" & Threshold_2024 == "Below Threshold" ~ -Customers_Transitioned,
TRUE ~ 0
)) %>%
summarise(Net_Change = sum(Change))
# Track Customers Who Changed Volume Segments (Low/Medium/High)
volume_change_customers <- customer_summary %>%
select(CUSTOMER_NUMBER, YEAR, Volume_Segment, Total_Volume, Order_Frequency) %>%
pivot_wider(names_from = YEAR, values_from = c(Volume_Segment, Total_Volume, Order_Frequency),
values_fill = list(Volume_Segment = "No Purchase", Total_Volume = 0, Order_Frequency = 0)) %>%
rename(Volume_2023 = Volume_Segment_2023, Volume_2024 = Volume_Segment_2024,
Volume_Ordered_2023 = Total_Volume_2023, Volume_Ordered_2024 = Total_Volume_2024,
Order_Frequency_2023 = Order_Frequency_2023, Order_Frequency_2024 = Order_Frequency_2024)
# Identify customers with consistent or sporadic increases
volume_growth_analysis <- volume_change_customers %>%
filter(Volume_2023 != "No Purchase" & Volume_2024 != "No Purchase" & Volume_2023 != Volume_2024) %>%
mutate(Volume_Growth_Trend = case_when(
Volume_Ordered_2024 > Volume_Ordered_2023 ~ "Consistent Growth",
Volume_Ordered_2024 < Volume_Ordered_2023 ~ "Fluctuating",
TRUE ~ "Stable"
))
# Identify patterns in customer order frequency changes
order_frequency_analysis <- volume_change_customers %>%
mutate(Frequency_Change = Order_Frequency_2024 - Order_Frequency_2023,
Frequency_Pattern = case_when(
Frequency_Change > 0 ~ "Increasing Frequency",
Frequency_Change < 0 ~ "Decreasing Frequency",
TRUE ~ "Stable Frequency"
))
# Calculate average percentage drop for customers who reduced order volume
order_volume_drop_analysis <- volume_change_customers %>%
filter(Volume_Ordered_2023 > 0 & Volume_Ordered_2024 < Volume_Ordered_2023) %>%
mutate(Percentage_Drop = case_when(
Volume_Ordered_2023 > 0 ~ ((Volume_Ordered_2023 - Volume_Ordered_2024) / Volume_Ordered_2023) * 100,
TRUE ~ NA_real_ # Avoid division by zero
))
# Print summaries
print(order_frequency_analysis)
## # A tibble: 30,322 × 9
## CUSTOMER_NUMBER Volume_2023 Volume_2024 Volume_Ordered_2023
## <dbl> <chr> <chr> <dbl>
## 1 500245678 Low Volume Low Volume 370
## 2 500245685 Medium Volume Low Volume 602.
## 3 500245686 Low Volume Low Volume 17.5
## 4 500245687 Low Volume Low Volume 125
## 5 500245689 Medium Volume Medium Volume 546.
## 6 500245690 Low Volume Low Volume 325
## 7 500245695 High Volume Medium Volume 1038.
## 8 500245698 Low Volume High Volume 282
## 9 500245701 Low Volume Low Volume 388
## 10 500245704 High Volume High Volume 1585
## # ℹ 30,312 more rows
## # ℹ 5 more variables: Volume_Ordered_2024 <dbl>, Order_Frequency_2023 <int>,
## # Order_Frequency_2024 <int>, Frequency_Change <int>, Frequency_Pattern <chr>
print(order_volume_drop_analysis)
## # A tibble: 14,742 × 8
## CUSTOMER_NUMBER Volume_2023 Volume_2024 Volume_Ordered_2023
## <dbl> <chr> <chr> <dbl>
## 1 500245685 Medium Volume Low Volume 602.
## 2 500245690 Low Volume Low Volume 325
## 3 500245695 High Volume Medium Volume 1038.
## 4 500245701 Low Volume Low Volume 388
## 5 500245704 High Volume High Volume 1585
## 6 500245725 High Volume Medium Volume 1015
## 7 500245726 Low Volume Low Volume 60
## 8 500245732 Low Volume Low Volume 25
## 9 500245740 Low Volume Low Volume 129.
## 10 500245765 Low Volume Low Volume 139
## # ℹ 14,732 more rows
## # ℹ 4 more variables: Volume_Ordered_2024 <dbl>, Order_Frequency_2023 <int>,
## # Order_Frequency_2024 <int>, Percentage_Drop <dbl>
# Visualization for Order Volume Drop Distribution
ggplot(order_volume_drop_analysis, aes(x = Percentage_Drop)) +
geom_histogram(binwidth = 5, fill = "purple", alpha = 0.7, color = "black") +
theme_minimal() +
labs(title = "Distribution of Order Volume Drop (2023 → 2024)",
x = "Percentage Drop in Order Volume",
y = "Number of Customers")
ggplot(order_frequency_analysis, aes(x = Frequency_Pattern, fill = Frequency_Pattern)) +
geom_bar() +
theme_minimal() +
labs(title = "Customer Order Frequency Changes (2023 → 2024)",
x = "Order Frequency Pattern",
y = "Number of Customers",
fill = "Frequency Change") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(volume_growth_analysis, aes(x = Volume_Growth_Trend, fill = Volume_Growth_Trend)) +
geom_bar() +
theme_minimal() +
labs(title = "Customer Volume Growth Trends (2023 → 2024)",
x = "Growth Trend",
y = "Number of Customers",
fill = "Trend Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Out of EDA, we could find out below insights
There are 30,322 unique customers from 2023/01/01 to 2024/12/31. (2 years) out of transaction history data.
Even if approximately 90% of customers belongs to the Local Market Customers, their total volume of transaction takes up 72% of entire transaction volumes.
Even though there is not much change of ordering pattern between CO2 customer and Non CO2 customer in 2023 and 2024, median volume per order has increased by over 10% in 2024 compared to 2023 for C02 customer.
SALES REP (sales representatives) remains in 75% of order type for 2 years transactions, followed by OTHERS, and MYCOKE360 (Digital Ordering Platform), which indicates that personal interaction is still significant to maintain the sales.
In terms of order volume percentage per year, Goods channel increase by 2% points from 2023 to 2024.
BULK TRADES and DINING takes over 50% of entire transaction volume in both 2023 and 2024.
14,742 customers experienced a decline in order volume, including some high-volume customers moving to medium or low volume. Growth segment: Certain customers moved from low to high volume, indicating rising demand and potential need for priority servicing.
Some customers crossed above or below the 400-gallon threshold, affecting route efficiency and delivery planning. Net Impact: Helps assess whether SCCU should expand direct delivery routes or refine ARTM logistics.
Increased order frequency suggests growth potential, while decreased frequency may signal churn risk.